-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files. #417
Conversation
� Conflicts: � components/core/src/clp/GlobalMySQLMetadataDB.cpp
7d848a4
to
425377b
Compare
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit message suggestion:
core-clp: Add support for decompressing an IR from a specific file split from a clp archive.
I've done my parts of review, maybe you can take it over from here @kirkrodrigues
Forgot to mention, since we now have a log event serializer, can we add some unit tests to test serialization + deserialization? |
Co-authored-by: kirkrodrigues <[email protected]>
Co-authored-by: kirkrodrigues <[email protected]>
{ | ||
SPDLOG_ERROR( | ||
"Failed to create directory structure {}, errno={}", | ||
output_dir.c_str(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_dir.c_str(), | |
temp_output_dir.c_str(), |
if (false == res) { | ||
close_writer(); | ||
return true; | ||
} | ||
|
||
m_is_open = true; | ||
|
||
// Flush the preamble | ||
flush(); | ||
|
||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (false == res) { | |
close_writer(); | |
return true; | |
} | |
m_is_open = true; | |
// Flush the preamble | |
flush(); | |
return false; | |
if (false == res) { | |
close_writer(); | |
return false; | |
} | |
m_is_open = true; | |
// Flush the preamble | |
flush(); | |
return true; |
we should return false
to indicate error right?
} | ||
begin_message_ix = end_message_ix; | ||
|
||
if (auto const error_code = ir_serializer.open(temp_ir_path.string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not error_code
|
||
LogEventSerializer<four_byte_encoded_variable_t> ir_serializer; | ||
// Open output IR file | ||
if (auto const error_code = ir_serializer.open(temp_ir_path.string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not error_code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
core-clp: Add support for decompressing a specific file split from a clp archive into one or more IR files.
References
Description
The change is motivated by the need to support log viewer, which
This PR introduces a new decompression interface
decompress_ir
that decompress a file split into one or multiple IRs.The function takes in an original file ID, a specific message index and a threshold.
It first find the file split which contains the message index, and decompress the split into one or more IR; the function creates a new IR whenever the current IR's raw size (i.e. not zstd compressed) is greater than the given threshold.
Each IR follows the IRv1 format, meaning it has the complete preamble and and EoF byte, and can be deserialized individually.
The generated IR use the naming format: <FILE_ORIG_ID><begin_message_ix><end_message_ix>.clp.zst. Since the preamble of the IRv1 doesn't contain any log event index information, this name is essential for the user of the IR to know what's the range of log index the IR contains.
The PR also introduces a new class
LogEventSerializer.cpp
that serialized a plain text message into the IR format.Due to the limitation of our current IR related encoding APIs, the function is designed with two inefficienies
We agreed that these two are acceptables as properly supporting the flow will take more thoughts on reworking the encoder interface.
Validation performed
The validation is not directly performed on this PR, but on a following PR which adds the
decompress_ir
in to the execution path of clp executable.To validate the functionality, we compressed a 64MB file into archive(s). We then decompressed it into mulitple IRs, decoded and concatnate them, and did a binary comparison with the original file.
We used two configuration to cover all the possible cases:
Compressed a 64MB hadoop log using smaller encoded file size and archive size, such that it splits the original file into 3 splits across 2 archives. We then decompressed all 3 IRs by running clp 3 times, using different message index
Compressed the 64MB hadoop log using default settings, so only one file and archive was generated. We then decompressed the IR using a 32MB threshold, generating 3 IRs on disk.